MacTech 1 to 12

home *** CD-ROM | disk | FTP | other *** search

/ MacTech 1 to 12 / MacTech-vol-1-12.toast / Source / MacTech® Magazine / Volume 12 - 1996 / 12.03 Mar 96 / Multiprocessing.text < prev

Wrap

Text File | 1996-01-30 | 16.0 KB | 199 lines | [TEXT/R*ch]

DAYSTAR DIGITAL Overview DayStar’s new MP systems are standard Macintoshes, with one major exception: they contain more than one CPU. The Apple MP API, which was designed in conjunction with DayStar, defines a set of services that allows developers to create and communicate with multiple elements of execution called ‘tasks’. When tasks are run on a multiprocessor system they are scheduled and run simultaneously on all the available processors. Task creation is accomplished by providing a pointer to a function already defined within existing application code. The most obvious advantage of this approach is that you can use existing tools and build processes to construct an MP-aware application. No special compilers or packaging of the task code are required. Tasks have complete access to all the memory in the system. If an application has retrieved and prepared data for processing it can simply tell the tasks where the data is. It is not necessary to move any data to specialized task-only memory thus avoiding expensive transactions over system busses. According to the Apple MP API specification the processors in an MP system must be cache-coherent. This means that the developer need not be concerned with the possibility that data stored in the cache of one processor has not yet been written to main memory. If any other processor accesses that memory, the MP hardware will automatically ensure that the value cached within the other processor is retrieved, rather than the value in main memory. The MP API’s assumption of cache-coherency makes programming significantly easier; programming non-cache-coherent systems is far more error-prone and is not for the faint of heart. Tasks run preemptively on all systems, including those with a single processor. If an application is willing to require the presence of PowerPC hardware and the shared library that provides the MP API services, the creation of MP-aware applications can be greatly simplified. The application simply creates tasks and distributes the work accordingly. The tasks created could do all the work while the application checks for user events and controls the flow of data. The MP API is Apple system software. It will be carried forward into Copland and is in fact a subset of the Copland tasking model. Even though tasks and applications share the same memory, it is very important that they communicate, at least initially, via one of the three communication primitives provided: message queues, semaphores and critical regions. Communicating via these primitives ensures that all former memory accesses made by the communicant are completed before the recipient starts using those locations, i.e. ensuring that shared resources are accessed atomically. Using the communication primitives also provides a method by which a task can yield time if it has to wait for something that is not yet available. Task Communication There are three main inter-task communication mechanisms. The first are message queues. Message queues are first-in-first-out queues of 96-bit messages. Messages are useful for telling a task what work to do and where to look for information relevant to the request being made, such as a pointer into main memory. They are also useful for indicating that a given request has been processed, and, if necessary, what the results are. Message queues incur more overhead than the other two communication primitives. If you cannot avoid frequent synchronization, at least try to use a semaphore instead of a message queue. Semaphores are store a value between 0 and some arbitrary positive integer value. The value in a semaphore can be raised and lowered, but never below 0 and never above the semaphore’s maximum value. Semaphores are useful for keeping track of how many occurrences of a particular thing are available for use. Binary semaphores which have a maximum value of 1 are especially efficient mechanisms for indicating to some other task that something is ready. When a task or application has finished preparing data at some previously agreed-upon location, it raises the value of a binary semaphore, which the target task can be awaiting. The target task lowers the value of the semaphore, performs any necessary processing, and raises the value of a different binary semaphore to indicate that it is done with the data. This technique can be used to replace the message queue pairs described above, using the “Divide And Conquer” technique. MPCreateBinarySemaphore() is a macro that exists to simplify the creation of binary semaphores. Critical regions are used to ensure that no more than one task (or the application) is executing a given “region” of code at any given time. For example, if part of a task’s job is to search a tree and modify it before proceeding with its primary work, then if multiple tasks were allowed to search and try to modify the tree at the same time, the tree would quickly become corrupted. An easy way to avoid the problem is to form a critical region around the tree searching and modification code. When a task tries to enter the critical region, it will be able to do so only if no other task is currently in it – thus preserving the integrity of the tree. Cost The cost of the DayStar Genesis system, which comes with four 604 processors and a minimum of 16MB and 1GB, will range from $10,000 to $15,000. Sample Code The sample code uses two queues as the communication mechanism between tasks. Each task has a receive queue for messages from the application, and the application has a global queue for messages from the tasks. When work is being done by the tasks, the front end could either block on its queue, or poll the queue and call WaitNextEvent(). When a task finishes a segment of the fractal image, it sends the results back to the front end and blocks on its queue for another segment to processes. err = 0 if( !MPLibraryIsLoaded() ) /* Check that the MP library is present */ err = 1; /* Check that the library is compatible our header */ if( (err == noErr) && !MPLibraryIsCompatible() ) err = 1; if( err == noErr ) numProcessors = MPProcessors(); else numProcessors = 1; /* Only use the host processor */ /* Allocate memory for each processor (each task) */ gTaskData = (TaskData *) NewPtrClear(numProcessors * sizeof (TaskData)); assert(gTaskData != NULL); /* Handle the error better than this */ /* Allocate a queue for the main application to wait on */ err = MPCreateQueue( &gMainAppQueue ); assert(err == noErr); /* Handle the error better than this */ /* Allocate a send queue and a task for each processor */ err = noErr; for( i = 0; i < numProcessors && err == noErr; i++ ) { err = MPCreateQueue( &gTaskData[i].taskToAppQueue); assert(err == noErr); /* Handle the error better than this */ gTaskData[i].taskToAppQueue = gMainAppQueue; /* Create a task from the function fTask() */ err = MPCreateTask( fTask, &gTaskData[i], 2048, NULL, NULL, NULL, 0, &gTaskData[i].taskID ); assert(err == noErr); fSendMessage( gTaskData[i].appToTask, kTMCreate ); /* We get an immediate reply to our kTMCreate message */ fReceiveMessage( gMainAppQueue , &message ); } /* The main application loop now posts action commands to each task */ /* queue then blocks on its recieve queue (gMainAppQueue) until a */ /* task has finished a segment of the image. When all segments are */ /* rendered, a terminate message is sent and each task quits */ /* This is the task code that runs on each processor */ /* The variable “p” was passed in at creation time to the task */ finished = false; while( !finished ) { fReceiveMessage( p->appToTask , &message ); switch( message ) { case kTMCreate: break; case kTMRun: main( &p->zc, &p->zd, &p->step, &p->escape, p->width, p->results ); break; case kTMQuit: finished = true; break; } fSendMessage( gMainAppQueue , kTMReady ); } return( noErr ); YARC SYSTEMS Overview The YARC environment uses both hardware and software in order to achieve multiprocessing. YARC offers plug-in accelerator cards for PCI and NuBus systems which contain one or two 80mhz 601 processors and onboard RAM that also runs at 80mhz. In concept, the cards may be compared to a number of independent, tightly coupled, networked machines where the network is the peripheral device bus. In the PCI implementation of the boards, this type of networked connection becomes even more powerful because of the high bandwidth of PCI. Having live processors with fast local memory, the multiprocessing provided by the YARC environment is under full application control, without the operating system scheduling and running tasks. This offers developers a “real time” acceleration engine where CPU cycles can be closely accounted for and controlled by an application’s code. But if the full bandwidth of the processors is not used, YARC also provides a thread manager capable of running multiple threads (or tasks) on any remote processor. This multiprocessing is cooperatively (or voluntarily) scheduled, which is identical to what is implemented by the PowerPC Thread Manager on the Macintosh. The YARC multiprocessing environment therefore offers fast, guaranteed access to remote CPU horsepower, with the ability to fine tune processor load by adding scheduled multiprocessing for any of the attached board processors. Because the YARC systems isn’t tightly coupled to the MacOS, creating “tasks” for scheduled execution involves a special development environment. This package costs $495 and is built around the GNU C compiler. YARC is working on a PEF loader which would eliminate the need for a custom development setup. Cost Boards start at $2,995 with one 80mhz 601 CPU and 8mb of RAM. The most powerful board is currently the two-processor HYDRA board, with 128mb of RAM. This board tops out at $13,000. Sample Code #define MAXBOARDS 16 static Board *board[MAXBOARDS]; ... y_configure(); /* Initialize the environment and boards */ if ((vfd = vio_open("AppToLoad.ppc", VO_RDONLY)) < 0) vioerror("AppToLoad.ppc"); err = noErr; numBoards = 0; while((board[numBoards] = y_open(0,0)) != NULL && numBoards < MAXBOARDS) { if ((err = yk_loadkernel(board[numBoards])) != noErr) { yerror(board[numBoards], "Unable to load YARC PPC kernel"); break; } if (yk_loadxcoff(board[numBoards], vfd, &info) < 0) { yerror(board[numBoards], "Unable to load PPC code to board"); break; } numBoards++; } vio_close(vfd); for(k=0; k < numBoards; k++) { err = yk_setargs(board[k], &info, NULL, NULL); err = yio_init(board[k], 0, 1, 2); /* Init stdio */ err = ykiret(board[k]); /* Start task code */ } POWERTAP Overview PowerTap is a software library that runs on all Macintosh models. It can assign work to all processors on all Macintoshes connected by a network. PowerTap simplifies multiprocessing by performing all of the scheduling, task management and error recovery, interfacing to the host software as a simple black box where tasks are submitted and results are retrieved. Candidate applications are those that are computationally intense and can be divided into independent pieces. PowerTap is intended for jobs that take more than a couple of seconds, although shorter jobs are practical when using attached processors. The assumption is that any job that computes for a minute or an hour must be looping in some way. Typically, it is working on each pixel/band/timeslice/piece in a similar manner. So the developer takes the contents of such an existing loop and moves that code into a DoTask() function, rather than restructuring the entire application. To use PowerTap, a developer divides a job into multiple, independent pieces referred to as “tasks”. [PowerTap tasks are different from Apple’s notion of a MP task. PowerTap tasks refer to data, such as one tile or band of an image.] No task may depend on the results of other tasks in the same job. A host-supplied function, called DoTask(), is needed, that can perform any of the tasks, given two host-defined blocks of data. One of the blocks is the task-specific data, and the other block is common to all or most of the tasks in the job. Separating the two enables PowerTap to minimize network traffic. To get a job done, the host software creates the separate tasks and submits them to the PowerTap library using SubmitTask(). Subsequent calls to PTIdle() cause the work to be performed on other CPU’s and/or by the local DoTask(). Task results are retrieved by calls to GetNextResult() or GetTaskResult(). Completed results and task data are available throughout the duration of the job, so there is no need to maintain queues or provide error handling for the myriad potential errors. The basic sequence is: InitPowerTap() OpenJob() SubmitTask() [once for each task] PTIdle() and GetNextResult() or GetTaskResult() until all results are done CloseCurrJob() ClosePowerTap() The PowerTap library and DoTask() are linked into the host software. This means the host programmer does not have to code the algorithm two different ways, depending on Gestalt results—the job will be performed, regardless of the platform or environment. Remote taps are complete, faceless, background-only (FBA) applications built from a Tap Module (provided), plus the host’s DoTask(), plus a customization resource. Users of remote machines being tapped can control their Tap with a local control panel (provided). This provides on/off control as well as an adjustment for how much or little CPU time will be given to the Tap. Each tap has a customization resource which identifies the tap and provides settings for buffer sizes, CPU sharing and other things. There are several optional calls available for obtaining stats for the job and for individual task performance, limiting the number of participating remote Macs, and other features. Cost The end user has no additional costs required. PowerTap works with all Macintosh models. There can even be a relative cost savings if the end user sets up a small number of very powerful machines and uses PowerTap to enable many people to tap into the power of those “power servers.” The developer must license one copy of PowerTap. This entitles them to unlimited distribution as part of their product with no royalties or periodic renewal fees. The present price range is $1,200 to $2,700, depending on the number of remote taps that can be used. Sample Code The sample fractal code is below. The DoTask() routine is not shown; however, it would consist of a routine that takes a pointer to the job data and the task data. The PowerTap libraries would be responsible for sending the task data and job data across the network to and from each tap. #define kNumTasks 20 ... err = InitPowerTap( kOnlyGuest + kUseGenesisAPI ); // Allocate the initial request param block that gets sent to each task jobLen = sizeof( JobBlock ); theJobData = (JobBlock**) NewHandleClear( jobLen ); (**theJobData).zc = -0.75; (**theJobData).zd = 0; (**theJobData).step = 0.0001; (**theJobData).escape = 50.0; (**theJobData).width = 1500; // choose a job number that will be unique theJobNum = TickCount(); err = OpenJob( theJobNum, (Handle) theJobData, jobLen ); taskLen = sizeof( TaskBlock ); // submit all of the tasks. they queue ~ LIFO. // hard-code the number of tasks as kNumTasks = 50 for the sample. for ( i = kNumTasks - 1; i >= 0L; i-- ) { taskData = (TaskBlock**) NewHandle( taskLen ); if ( taskData != NULL ) { (**taskData).startLine = i * 1500 / kNumTasks; (**taskData).endLine = (i+1) * 1500 / kNumTasks - 1; err = SubmitTask( i, (Handle) taskData, taskLen, NULL ); } } // act on the task results as they come in… nDone = 0L; while ( nDone < kNumTasks ) { // get all of the results that are ready now. while ( GetNextResult( &taskNo, (Handle*) &result, &rLen, macName ) ) { DrawResult( taskNo, (ResultBlock**) resultHand, macName ); nDone++; } // call PTIdle to give PowerTap some time to juggle the tasks. if ( PTIdle( 2L ) != noErr ) break; WaitNextEvent( everyEvent, &theEvt, 2L, NULL ); } // we are done now. ClosePowerTap(); DisposeHandle( (Handle) theJobData );